From data strategy to implementation to advance cancer research and cancer care: A French comprehensive cancer center experience

In a comprehensive cancer center, effective data strategies are essential to evaluate practices, and outcome, understanding the disease and prognostic factors, identifying disparities in cancer care, and overall developing better treatments. To achieve these goals, the Center Léon Bérard (CLB) considers various data collection strategies, including electronic medical records (EMRs), clinical trial data, and research projects. Advanced data analysis techniques like natural language processing (NLP) can be used to extract and categorize information from these sources to provide a more complete description of patient data. Data sharing is also crucial for collaboration across comprehensive cancer centers, but it must be done securely and in compliance with regulations like GDPR. To ensure data is shared appropriately, CLB should develop clear data sharing policies and share data in a controlled, standardized format like OSIRIS RWD, OMOP and FHIR. The UNICANCER initiative has launched the CONSORE project to support the development of a structured and standardized repository of patient data to improve cancer research and patient outcomes. Real-world data (RWD) studies are vital in cancer research as they provide a comprehensive and accurate picture of patient outcomes and treatment patterns. By incorporating RWD into data collection, analysis, and sharing strategies, comprehensive cancer centers can take a more comprehensive and patient-centered approach to cancer research. In conclusion, comprehensive cancer centers must take an integrated approach to data collection, analysis, and sharing to enhance their understanding of cancer and improve patient outcomes. Leveraging advanced data analytics techniques and developing effective data sharing policies can help cancer centers effectively harness the power of data to drive progress in cancer research.


Introduction
Cancer is a disease with multidimensional complexity that affects millions of people worldwide, and it requires a comprehensive approach for effective treatment and management [1,2].Comprehensive cancer centers play a crucial role in providing state-of-the-art cancer care and conducting cutting-edge research to advance our understanding of the disease [3].To effectively fulfill these roles, comprehensive cancer centers need to adopt effective data strategies that enable the efficient collection, analysis, and use of data [4,5].Contending with a continuously expanding volume and a variety of clinical data poses challenges and opportunities for a comprehensive cancer center.For several years, scientific cancer societies have been able to describe their data strategies, but essentially on a national scale and with a very (too?) theoretical approach [6][7][8][9].This theoretical and generalist approach is essential because it must be shared, but its practical application can most often only be developed at the institutional level.
Data from EMR and all databases are a crucial asset in comprehensive cancer centers, as it provides valuable information about patient care, disease progression, and treatment outcomes.This information can be used to inform clinical decision-making, improve patient outcomes, and drive research advancements and even more with the rise of artificial intelligence [10,11].Furthermore, data can also be used to monitor and evaluate the performance of cancer centers, which is essential for quality improvement initiatives and accreditation [12,13].It is therefore essential for such hospital structures to define a strategy for collecting and analyzing data, as well as sharing health data in compliance with the regulatory constraints in force.
In this article, we present the global data strategy of a French comprehensive cancer center, providing concrete examples of practical implementation.Our aim is to share the advantages and difficulties of this institutional organization, illustrating how a well-designed data strategy can contribute to improved patient care and research advancement in the complex world of cancer treatment and management.

Centre Le ´on Be ´rard
The French comprehensive cancer center Le ´on Be ´rard, located in the Rho ˆne-Alpes region, launched in 1989 a pilot phase to implement a single EMR procedure for each cancer patient.This procedure was adopted from January, 1993, and computerized medical observations and hospitalization reports, then progressively integrated prescriptions of chemotherapy and blood products since 1996, external medical reports since 2000, and additional files since 2005 such as medical imaging, histology results, other medication, etc.In July 2002, EMR became the reference file, and a full electronic format for medical records was definitively adopted in 2006 with regular updates performed on a daily basis.Since 2015, a patient portal « myCLB » has allowed cancer patients to access their medical records, schedule appointments, and report their symptoms on online forms.Computerization of medical records opens up opportunities for profound change in data sources availability on extended medical information on cancer management.For this purpose, the CLB has an information system directorate, which includes a unit managing IT infrastructure and cybersecurity (10 persons), a unit managing internal development of IT solutions and interfaces with business software (10 persons), and since 2021, a data factory (10 persons) aimed at promoting data structuring and facilitating its reuse.The whole is directed by a couple doctor/engineer allowing to have a dual approach and requiring a close collaboration (S1 Fig) .Fig 1 represents all the computer software used, specifying their years of installation and whether they were developed in internally by our IT department or purchased from an external service provider.For example, the Biological Resources Center and the information system allow us to specify, for each patient, all quantities of biological materials (tumors, blood, plasma, etc.) that are still available.The medical review form is a simple digital form that nevertheless enables the collection of digital data (dermatology photos, indications for oncogenetic investigations, or specific symptoms for gynecological consultation requests.).

Tools
CONSORE.Since 2015, CLB actively participated in the development and deployment of CONSORE [14].The CONSORE project is a component of the UNICANCER initiative, a network of comprehensive cancer centers in France.The goal of the CONSORE project is to develop a decentralized and standardized repository of patient data to support cancer research and improve patient outcomes.CONSORE aims to use data from multiple sources, including electronic medical records (EMRs), pathology reports, and clinical trial data, to create a comprehensive and up-to-date view of patient data.The project uses advanced data analytics and NLP (Natural Language Processing) techniques to extract, structure, and analyze data, with the goal of providing insights into patient outcomes, treatment patterns, and other relevant information.CONSORE represents an important step forward for comprehensive cancer centers in France, as it will enable them to better understand the disease, identify disparities and unmet needs in cancer care, and develop more effective treatments.By leveraging the collective resources of the UNICANCER network, the CONSORE project has the potential to greatly enhance our understanding of cancer and improve patient outcomes.
OSIRIS.One approach to standardized data collection and analysis in comprehensive cancer centers is the use of common data models, such as the OSIRIS model [15].The OSIRIS model is a data model specifically designed for oncology research and provides a standardized framework for collecting and analyzing cancer data.By using a common data model like OSI-RIS, comprehensive cancer centers can ensure that their data collection, analysis, and sharing strategies are consistent and comparable.This can facilitate the sharing of data between institutions and researchers and enable the development of more robust and informative analysis techniques.By adopting a common data model, comprehensive cancer centers can take a collaborative approach to cancer research and improve our understanding of this complex disease.

General Data Protection Regulation (GDPR)
In the European context, data collection, analysis, and sharing strategies in comprehensive cancer centers are governed by the General Data Protection Regulation (GDPR).The GDPR provides a framework for protecting the privacy and security of personal data, including health data.Comprehensive cancer centers must ensure that their data collection, analysis, and sharing strategies comply with the GDPR and protect patient privacy.This may involve obtaining patient consent for the collection and use of personal data, implementing appropriate security measures to protect the confidentiality of patient information, and regularly reviewing and updating data collection and analysis procedures to ensure compliance with the GDPR.Failure to comply with the GDPR can result in significant fines and reputational damage, so it is essential for a French comprehensive cancer centers to take a proactive approach to data privacy and security.To achieve this, we have established two levels of patient information.Firstly, general information is displayed in our premises, sent through our patient portal myCLB, and included in the welcome booklets provided to all new patients [16].In addition to this general information, individual patient information is included in these retrospective studies.It must be performed for each project in which the patient participates or for which the patient's data will be processed.Thus, each patient has free access to the French website (https:// mesdonnees.unicancer.fr/) to see if their data is being used for a research project and can choose to refuse or allow the reuse of their data.

Results
In parallel with the daily use of digital health data and the development or integration of new computer software, we defined a data strategy making it possible to have reliable, accessible and easily reusable data for health data research aligned with international references (FAIR guiding principles) [17].Our EMR mining process involves stages like obtaining data access, using SQL and tools for data extraction, and then refining the data, which includes both manual verification and automated processing via NLP.With CONSORE we have implemented an ETL (Extract, Transform, Load) pipeline that feeds into a structured data lake.Our next steps focus on enhancing NLP capabilities, collaborating with clinical experts, and expanding data sources (incorporing imaging data and genomic information) for a holistic patient view.This strategy is based on 4 main axes: data collection, data quality control, data analyses and data sharing (S2 Fig) .Our EMR mining Process and our next steps are described in S3 Fig.

Data collection strategies
For an effective data collection we first mined the use of EMRs, which provide a centralized repository for patient data and can be accessed by healthcare providers and researchers (pseudo-anonymously). CLB can collect data through patient-reported outcome measures (PROMs) via the portal patient myCLB, which provide valuable information about the patient experience and treatment outcomes.There are thus several ways of structuring the information of the EMR, either manually by the health professionals, or manually by the patient (essentially for adverse effects) or with the support of NLP.We have decided to primarily focus our efforts on a limited but comprehensive set of tumor-centered data.

Patients
grade � 2. These toxicities data are associated with symptoms reported by our cancer patients based on PROMs (7949 symptom forms completed by outpatients).
Molecular markers.Natural language processing can be used to extract insights from unstructured data, such as medical notes and patient report.NLP is a powerful tool for data structuration.NLP algorithms can be used to extract and categorize information from unstructured sources, such as EMRs.By using NLP, we can gain insights into patient outcomes, treatment patterns, and other relevant data that would otherwise be difficult or time-consuming to  score).Our analysis covers 98% of the pathology results from the Le ´on Be ´rard center and allows us to obtain 95% of accuracy compared to a manually created breast cancer database.However, it remains to assess the accuracy of the results of these 4 tissue biomarkers on analyzes from other hospitals and also to expand the panel of biomarkers studied.One of the limitations of using NLP (Natural Language Processing) is the need for a sufficiently large and diverse training dataset to achieve adequate model performance.For some data sources, this requires significant human effort in structuring and annotating the data.

Data quality control
Once data has been collected, it is essential to analyze it to extract meaningful insights and inform clinical decision-making.Quality control is another major point concerning data control process.Data quality can be defined as the absence of errors that matter and can be achieved with robust and documented processes.For example, we compared cancer location, cancer diagnosis dates and overall survival obtained by CONSORE with a validated monitored data issued from the PROFILER clinical trial (NCT01774409) whose data was manually integrated by specialized clinical research associates including 2025 metastatic cancer patients [19].The primary goal of PROFILER trial was to analyze molecular profiles of advanced cancer patients and to recommend therapies based on genetic mutations.Thus, we can state that 98% of cancer localizations and 96% of morphologies are accurate and aligned with the International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3) [20].Quality control, currently done twice a year by CONSORE during version upgrades, focuses on data for research projects, with plans to establish a structured process to track authorship and validation dates for each tumor file.

Data analysis strategies
Real-world data (RWD) studies are becoming increasingly important in the field of cancer research [21].RWD refers to data that is collected from patients in the everyday setting of their medical care, outside of the controlled environment of clinical trials.RWD provides a more complete and accurate picture of patient outcomes and treatment patterns in the real world, as opposed to data collected in clinical trials, which may not fully reflect the complexity of realworld patient populations and treatment regimens [22].RWD can also inform the development of new treatment strategies and help to identify disparities and unmet needs in cancer care.Additionally, RWD can support the development of predictive models and decisionmaking tools that can be used to optimize treatment and improve patient outcomes [23].By incorporating RWD into our data collection, analysis, and sharing strategies, we can take a more comprehensive and patient-centered approach to cancer research.This can lead to a better understanding of the disease and prognostic factors, improved patient outcomes, and the development of more effective treatments for cancer.For example, this organization has recently enabled us to publish quickly several articles on unknown or poorly studied pathologies such as COVID-19 [24,25] and the impact of vaccination, or second primary cancers [26,27].

Data sharing strategies
Sharing data is critical for advancing cancer research and improving patient outcomes.Comprehensive cancer centers can share data with other institutions and researchers through data sharing agreements and consortia.For example, the French Health data hub is a platform that enables researchers to share data and collaborate on cancer research initiatives [28].Many multicenter research projects are underway with this national platform [29][30][31].Sharing data also enables the development of predictive models and enables researchers to test and validate new hypotheses.Additionally, data sharing agreements can help to ensure that patient privacy is protected while still enabling researchers to access the information they need.By sharing data, comprehensive cancer centers can work together to advance our understanding of cancer and improve patient outcomes.For instance, the implemented organization has enabled us to participate in research projects involving various hospitals and utilizing new artificial intelligence technologies [32,33].
CLB uses quality control to ensure the absence of errors in the collected data and believes that RWD studies provide a more complete and accurate picture of patient outcomes and treatment patterns than data collected in clinical trials.The strategies for collecting, analyzing, and sharing data are essential for cancer centers [36,37].Accurate and relevant data collection is the foundation of any effective data strategy, and cancer centers can use electronic medical records and patient-reported outcome measures to gather information about patients and their treatments [38].Once data is collected, it is important to analyze it to extract actionable insights, using analysis techniques such as machine learning and natural language processing.Finally, data sharing is essential to advance cancer research and improve patient outcomes.Cancer centers can share data with other institutions and researchers through data sharing agreements and consortia [39].
In the European context, data collection, analysis, and sharing strategies are governed by the General Data Protection Regulation (GDPR), which establishes a framework for protecting the privacy and security of personal data, including health data [40].Cancer centers must ensure that their data collection, analysis, and sharing strategies comply with GDPR and protect patient privacy [41].
Data interoperability is a critical aspect of data collection, analysis, and sharing strategies.It allows different systems and applications to exchange and use data effectively.To achieve interoperability, cancer centers can adopt standards such as FHIR and OMOP [42].Adopting these standards enables cancer centers to ensure that their data is compatible with that of other institutions and researchers, facilitating the exchange and integration of data and improving the quality and relevance of cancer research.
Despite all these efforts, many elements still need to be worked on.For example, in the era of precision medicine, it is necessary to improve the completeness of our histological data and somatic and constitutional molecular biology data, which are currently very limited.The integration of NGS results with clinical data offers a more comprehensive understanding, allowing for a personalized treatment approach.Similarly, the pool of blood biology data is almost unusable, and work to structure the results and align variables with the international LOINC coding is mandatory.Additionally, digitization of pathology department is necessary so that whole slide images (WSI) become available for research projects and for the future use of artificial intelligence in routine.While Whole Slide Imaging (WSI) through digitization can provide an immense wealth of morphological details, its true potential is amplified when combined with other diagnostic methods.Thus, making WSI available is a step towards a holistic patient profile, but not an end in itself.Finally, many interoperability works, including aligning data models with OMOP/FHIR, are still necessary to facilitate international research projects.In summary, data collection, quality control, analysis, and sharing strategies are essential to improving patient outcomes and advancing cancer research.Cancer centers must ensure that their strategies comply with current standards and regulations, while adopting innovative practices to ensure the efficiency and quality of data collection and analysis.

Conclusion
Data is a critical asset for comprehensive cancer centers, and effective data strategies are essential for delivering high-quality care, advancing research, and improving patient outcomes.The adoption of electronic medical records, patient-reported outcome measures, and data sharing agreements are all key components of a successful data strategy.By implementing these strategies, comprehensive cancer centers can leverage data to drive advancements in cancer research and provide the best possible care to patients.In short, effective data strategies are essential to fighting cancer.

Table 1 . Characteristics of cancer patients.
extract from unstructured sources.To date, we have focused only on 4 theranostic biomarkers: estrogen receptor, progesterone receptor, Human epidermal growth factor receptor 2 (HER 2) by immunochemistry and fluorescence in situ hybridization and CPS (combined positive